Positional Encoding and Representation Geometry in Masked Time Series Transformers

Timothy J. Gardner

February 2026 — Draft

Abstract We study how positional encoding schemes shape the internal representations learned by BERT-style masked prediction transformers on continuous time series. Using a synthetic dataset of Markov-switching circular trajectories in high-dimensional space, we compare three positional encoding strategies—sinusoidal (absolute), T5-style learned relative bias, and Rotary Position Embedding (RoPE)—in terms of convergence speed, final reconstruction accuracy, and the intrinsic dimensionality of intermediate representations. We find that relative positional encodings (T5 and RoPE) converge substantially faster and produce smoother, more monotonic dimensionality compression across transformer layers. RoPE achieves the fastest convergence (reaching the noise floor in ~50 epochs vs ~200 for sinusoidal), the lowest validation loss, and the tightest final intrinsic dimension (1.9 vs 2.6 for sinusoidal at $k=30$). These results suggest that the choice of positional encoding has a significant effect not just on training dynamics, but on the geometric organisation of learned representations.

1. Introduction

Transformer architectures have become the dominant approach for sequence modelling, but their mechanisms for encoding positional information vary widely. The original transformer [1] used fixed sinusoidal encodings added to input embeddings. Subsequent work introduced relative position schemes: T5 [2] uses learned additive biases on attention logits indexed by bucketed relative distance, while Rotary Position Embedding (RoPE) [3] encodes relative position through rotations applied to query and key vectors.

While these schemes have been extensively benchmarked on NLP tasks, less is known about how they affect the geometry of learned internal representations—particularly in continuous-valued time series settings where the signal has known low-dimensional manifold structure. This work uses a synthetic time series with precisely controlled geometric properties to study how positional encoding influences both training dynamics and the intrinsic dimensionality of representations at each transformer layer.

2. Synthetic Dataset

We construct a continuous time series that mimics syllable-structured sequential data (e.g. birdsong). A point traverses one of $C = 10$ circles embedded in $\mathbb{R}^{D}$ with $D = 20$, switching between circles according to a sparse Markov transition matrix with ring connectivity and long-range shortcuts.

2.1 Circle construction

Each circle $c \in \{1, \ldots, C\}$ is defined by a 2D plane $\text{span}(\mathbf{u}_c, \mathbf{v}_c)$ in $\mathbb{R}^D$, where $\mathbf{u}_c, \mathbf{v}_c$ are orthonormal vectors. The trajectory on circle $c$ at angle $\theta$ is:

$$\mathbf{x}_c(\theta) = r_c \left( \cos\theta \cdot \mathbf{u}_c + \sin\theta \cdot \mathbf{v}_c \right)$$

where $r_c$ is the radius. Each circle has a fixed angular velocity $\omega_c$, with periods ranging from 40 to 400 time steps (a 10× speed range). Dwell times on each circle average ~400 steps, quantised to whole revolutions so that entry and exit angles are consistent.

2.2 Geometric overlap control

The degree of geometric overlap between circles is controlled by constraining the 2D planes to a shared subspace of dimension $d_{\text{sub}} \leq D$. When $d_{\text{sub}} = D = 20$, each circle occupies a nearly orthogonal plane, producing minimal overlap. When $d_{\text{sub}} = 4$, ten 2D planes are forced into a 4D subspace, creating significant trajectory overlap—the model must rely on temporal dynamics rather than instantaneous geometry to distinguish circles.

2.3 Noise model

Isotropic Gaussian noise is added to every observation: $\mathbf{y}_t = \mathbf{x}_t + \boldsymbol{\epsilon}_t$ with $\boldsymbol{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_D)$. We set $\sigma = 2.83$, yielding SNR $\approx 2.5$. The noise floor for masked MSE loss is $\sigma^2 = 8.0$.

Sample time series window UMAP of raw data

Figure 1. Left: A sample window of the 20D time series (heatmap) with Markov state labels (top strip). Right: UMAP of the raw data with $d_{\text{sub}} = 20$ showing 10 well-separated circular manifolds.

3. Model Architecture

We train a BERT-style [4] masked prediction model adapted for continuous time series. The architecture is:

$$\text{Input} \; (B, T, 20) \;\xrightarrow{\text{mask}}\; \text{Linear}(20 \to d) \;\xrightarrow{\text{PE}}\; N \times \text{TransformerEncoder} \;\xrightarrow{}\; \text{MLP}(d \to 20)$$

where $d = 128$, $N = 7$ layers, 4 attention heads, and FFN dimension 512 with GELU activation. Masked positions (15% of each window) are replaced with a learnable [MASK] embedding before projection. The loss is MSE computed only on masked positions:

$$\mathcal{L} = \frac{1}{|\mathcal{M}|} \sum_{t \in \mathcal{M}} \| \hat{\mathbf{y}}_t - \mathbf{y}_t \|^2$$

where $\mathcal{M}$ is the set of masked time indices. Masking uses stochastic contiguous patches with sizes drawn uniformly from $[8, 128]$ time steps.

3.1 Positional encoding schemes

We compare three positional encoding strategies:

Sinusoidal (absolute). Following Vaswani et al. [1], fixed sinusoidal encodings are added to the input embeddings:

$$\text{PE}(t, 2i) = \sin\!\left(\frac{t}{10000^{2i/d}}\right), \quad \text{PE}(t, 2i+1) = \cos\!\left(\frac{t}{10000^{2i/d}}\right)$$

T5 relative bias (learned). Following Raffel et al. [2], a learned per-head bias $b_h(i - j)$ is added to the attention logits before softmax. Relative distances are mapped to $B = 64$ buckets using a logarithmic scheme: exact buckets for small distances, log-spaced buckets for distances up to $d_{\max} = 1024$. No positional information is added to the input embeddings.

RoPE (rotary). Following Su et al. [3], queries and keys are rotated by position-dependent angles before the dot product. For head dimension $d_h$ and position $t$, the rotation frequencies are $\theta_i = 10000^{-2i/d_h}$. The rotation is applied to pairs of dimensions:

$$\text{RoPE}(\mathbf{q}, t)_{2i:2i+1} = \begin{pmatrix} \cos(t\theta_i) & -\sin(t\theta_i) \\ \sin(t\theta_i) & \cos(t\theta_i) \end{pmatrix} \begin{pmatrix} q_{2i} \\ q_{2i+1} \end{pmatrix}$$

This ensures that $\langle \text{RoPE}(\mathbf{q}, t), \text{RoPE}(\mathbf{k}, s) \rangle$ depends only on $t - s$, encoding relative position directly into the attention geometry. RoPE requires a custom attention module; we implement this using PyTorch's scaled_dot_product_attention for flash-attention compatibility.

4. Training Protocol

All models use identical hyperparameters: AdamW optimiser with peak learning rate $3 \times 10^{-4}$, weight decay 0.01, linear warmup for 20 epochs followed by cosine decay over 500 total epochs. Batch size is 128, sequence length is 1024 time steps with stride 64. Training uses BF16 mixed precision and torch.compile on an NVIDIA RTX 5090 GPU. The dataset consists of 200,000 time steps with a 90/10 train/validation split.

5. Representation Analysis Methods

After training, we extract intermediate representations from each transformer layer by running the full dataset through the model without masking. We analyse these representations using two complementary methods:

UMAP visualisation. We use Uniform Manifold Approximation and Projection [5] to produce 2D embeddings of each layer's output, coloured by circle identity. This reveals qualitative structure: when circles separate into distinct clusters, the model has learned to disentangle the latent states.

Levina-Bickel intrinsic dimension estimation. We estimate the intrinsic dimensionality of each layer's representation using the maximum-likelihood estimator of Levina and Bickel [6]. For a point $\mathbf{x}$ and its $k$-th nearest neighbour at distance $R_k(\mathbf{x})$:

$$\hat{m}_k(\mathbf{x}) = \left[ \frac{1}{k-1} \sum_{j=1}^{k-1} \log \frac{R_k(\mathbf{x})}{R_j(\mathbf{x})} \right]^{-1}$$

The global estimate averages over all points. We report estimates at $k \in \{10, 30, 100\}$ to capture structure at different scales: small $k$ reflects local geometry while large $k$ captures global manifold structure.

6. Results

6.1 Training dynamics

The three positional encoding schemes exhibit markedly different convergence behaviour, despite identical architectures and hyperparameters.

Sinusoidal training loss T5 training loss RoPE training loss

Figure 2. Training loss curves for sinusoidal (left), T5 relative bias (centre), and RoPE (right). All runs use identical architecture, data, and hyperparameters; only the positional encoding differs.

MetricSinusoidalT5RoPE
Best val MSE8.068.037.996
Epochs to noise floor~200~100~50
Training plateauYes (epochs 80–130)NoNo

Table 1. Comparison of training dynamics across positional encoding schemes.

The sinusoidal model exhibits a plateau between epochs 80–130, suggesting the model passes through a more entangled intermediate representational state before finding a good solution. T5 eliminates this plateau, converging smoothly. RoPE converges roughly 4× faster than sinusoidal, reaching the noise floor by epoch ~50 with no plateaus or step transitions.

6.2 Intrinsic dimensionality across layers

We use the Levina-Bickel estimator at $k = 30$ to track how intrinsic dimensionality evolves through the network. The input data has an estimated dimension of 11.4.

LayerSinusoidalT5RoPE
Input (20D)8.011.411.4
Layer 112.47.49.0
Layer 210.76.86.5
Layer 39.66.14.5
Layer 48.75.43.6
Layer 57.24.72.9
Layer 64.43.72.3
Layer 72.62.51.9
Output (20D)1.21.61.6

Table 2. Levina-Bickel intrinsic dimension estimates ($k = 30$) at each layer for the three positional encoding schemes.

The sinusoidal model shows a characteristic expansion-then-compression pattern: Layer 1 increases dimensionality from 8.0 to 12.4 (projecting the 20D input into a 128D space that initially entangles the representations), followed by gradual compression through Layers 2–7. The major dimensionality squeeze occurs in the final two layers.

T5 eliminates the Layer 1 expansion entirely (7.4 vs 12.4), producing monotonic compression from the first layer onward. RoPE goes further: while Layer 1 is slightly higher than T5 (9.0 vs 7.4), it compresses more aggressively through the middle layers. By Layer 4, RoPE has already reached 3.6—a level that the sinusoidal model doesn't achieve until Layer 6. The final-layer dimension of 1.9 for RoPE is the closest of all three schemes to the true 1D structure of each circular manifold.

6.3 UMAP visualisation

Sinusoidal UMAP

Figure 3. UMAP of intermediate representations for the sinusoidal model. The circles become progressively separated through the layers, but Layer 1 shows a tangled, high-dimensional state.

T5 UMAP

Figure 4. UMAP of intermediate representations for the T5 model. Circles begin separating earlier (Layer 2–3) with a smoother progression.

RoPE UMAP

Figure 5. UMAP of intermediate representations for the RoPE model. Circles are well-separated by Layer 3–4, roughly 2–3 layers earlier than sinusoidal, with the tightest final clustering.

6.4 Effect of geometric overlap

We also trained the sinusoidal model on data with $d_{\text{sub}} = 4$, where the 10 circle planes are constrained to a shared 4D subspace. Despite the heavy geometric overlap (circles cannot be separated by the plane they occupy in any single time step), the model achieves nearly the same validation MSE (8.10 vs 8.06), demonstrating that the transformer can extract temporal structure (angular velocity differences, Markov transition patterns) to distinguish overlapping manifolds.

UMAP no overlap UMAP with overlap

Figure 6. Final UMAP panels for sinusoidal model trained on non-overlapping ($d_{\text{sub}} = 20$, left) and overlapping ($d_{\text{sub}} = 4$, right) circle configurations.

7. Discussion

Our results reveal a clear hierarchy among positional encoding schemes for this task, which we believe reflects fundamental differences in inductive bias:

Sinusoidal encoding creates a representational bottleneck. By adding fixed absolute position signals to the input embeddings, the model must first disentangle position from content—leading to the observed dimensionality expansion at Layer 1 and the training plateau around epochs 80–130. The model effectively needs to "unlearn" absolute position before it can build useful relative representations.

Relative encodings bypass this bottleneck. Both T5 and RoPE encode position through the attention mechanism itself (via additive bias or Q/K rotation, respectively), leaving the input embeddings free to represent content from the start. This produces monotonic dimensionality compression and eliminates the training plateau.

RoPE's rotational structure may be especially well-suited to periodic data. The underlying signal consists of points traversing circles—fundamentally periodic, rotational dynamics. RoPE encodes position via rotations in the embedding space, which may provide a natural basis for representing such structure. This could explain why RoPE achieves both the fastest convergence and the tightest final compression ($\hat{m} = 1.9$ vs 2.5–2.6 for the other schemes). However, we note that this hypothesis requires further investigation with non-periodic datasets.

Practical implications. For continuous time series modelling, our results suggest that relative positional encodings should be preferred over absolute sinusoidal encodings. RoPE in particular offers a compelling combination of fast convergence, high reconstruction accuracy, and compact learned representations, at the cost of requiring a custom attention implementation. T5 bias offers a simpler implementation path (using standard attention masks) with most of the benefits.

8. Conclusion

We have demonstrated that positional encoding has a substantial effect on both training dynamics and representation geometry in masked time series transformers. Using a synthetic dataset with known manifold structure, we showed that relative positional encodings (T5 bias and RoPE) produce smoother training, faster convergence, and more monotonic dimensionality compression compared to absolute sinusoidal encoding. RoPE achieves the best performance across all metrics, reaching the noise floor in approximately 50 epochs (4× faster than sinusoidal) while compressing representations to an intrinsic dimension of 1.9—close to the true 1D structure of the underlying circular manifolds.

The Levina-Bickel intrinsic dimension estimator provides a quantitative lens on the layer-by-layer organisation of transformer representations, complementing qualitative UMAP visualisations. Our synthetic framework—with controllable dimensionality, geometric overlap, and noise level—may be useful for future studies of transformer representation learning.

References

[1] A. Vaswani, N. Shazeer, N. Parmar, et al., "Attention is all you need," NeurIPS, 2017.

[2] C. Raffel, N. Shazeer, A. Roberts, et al., "Exploring the limits of transfer learning with a unified text-to-text transformer," JMLR, vol. 21, no. 140, pp. 1–67, 2020.

[3] J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, "RoFormer: Enhanced transformer with rotary position embedding," Neurocomputing, vol. 568, 2024.

[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," NAACL-HLT, 2019.

[5] L. McInnes, J. Healy, and J. Melville, "UMAP: Uniform manifold approximation and projection for dimension reduction," arXiv:1802.03426, 2018.

[6] E. Levina and P. J. Bickel, "Maximum likelihood estimation of intrinsic dimension," NeurIPS, 2004.